[PyTorch] Enable head dim 256 for FA4 by yaox12 · Pull Request #2932 · NVIDIA/TransformerEngine

yaox12 · 2026-04-27T09:29:44Z

Description

Need FA4 version 4.0.0b11.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-04-27T09:33:35Z

Greptile Summary

This PR enables head_dim=256 support for FlashAttention 4 by delegating head-dimension validation to FA4's own _validate_head_dims function, bumping the required FA4 version to 4.0.0b11, and adding a dedicated SM100-only test. A cross-attention fallback guard is also added for the hd256 path on SM100, where the dedicated kernel is known to diverge.

Backend change: The hard-coded SM100/SM90 head-dim table in get_attention_backend is replaced with a live call to FlashAttentionUtils.v4_validate_head_dims, keeping TE in sync with FA4's supported shapes automatically.
New test: test_dpa_fa4_hdim256 is added with explicit SM100/103 skipif guard (separate from test_dpa_fa4_base), and cuDNN version guards are removed from all FA4 tests since FA4 doesn't use cuDNN.
Cross-attention fallback: FA4 is disabled on SM100 for head_dim=256 when max_seqlen_q != max_seqlen_kv, guarding against the dedicated hd256 kernel's known divergence on decode-like shapes.

Confidence Score: 4/5

Safe to merge on SM100 hardware; the static head-dim table replacement and dedicated hd256 test are well-structured, but the single FA4 import statement still hard-fails for any FA4 build that does not export _validate_head_dims.

The core logic change — delegating head-dim validation to FA4's own function — is correct and the SM100-gated test properly signals intent. The remaining concern is that _validate_head_dims is imported in the same from … import block as the two main FA4 symbols: any older FA4 build missing that private symbol raises an unhandled ImportError at module load time, breaking all FA4 paths for those users. That import structure was flagged in the prior review round and is still unchanged in this PR.

transformer_engine/pytorch/attention/dot_product_attention/backends.py — the grouped FA4 import at lines 167–171 is the single point where a missing _validate_head_dims symbol would crash module load for users on older FA4 builds.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/attention/dot_product_attention/backends.py	Adds `_validate_head_dims` to the single FA4 import block; if this symbol is absent in older FA4 builds the entire module-level import fails with an unhandled ImportError (pre-existing unfixed concern).
transformer_engine/pytorch/attention/dot_product_attention/utils.py	Replaces the static SM-arch head-dim table with a live call to FA4's _validate_head_dims; adds the hd256 cross-attention fallback and preserves the MLA dV-misalignment workaround, both nested inside the v4_validate_head_dims is not None guard.
tests/pytorch/attention/test_attention.py	Properly gates test_dpa_fa4_hdim256 behind SM100/103 skipif; removes redundant cuDNN guards from all FA4 tests; no issues found.
qa/L3_pytorch_FA_versions_test/test.sh	Bumps FA4 version from 4.0.0b8 to 4.0.0b11 for both SM90 and SM100+ CI paths; straightforward and consistent with the PR requirement.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_attention_backend called] --> B{use_flash_attention_4\nAND v4_is_installed\nAND v4_validate_head_dims ≠ None?}
    B -- No --> Z[Skip FA4 head-dim validation]
    B -- Yes --> C[Compute _fa4_alignment\n= 16 // element_size]
    C --> D[Call v4_validate_head_dims\nhead_dim_qk, head_dim_v,\nsm_major, alignment]
    D -- AssertionError --> E[log: unsupported head dims\nuse_flash_attention_4 = False]
    D -- OK --> F{SM100 AND\nhd_qk == hd_v == 256\nAND max_seqlen_q != max_seqlen_kv?}
    F -- Yes --> G[log: hd256 cross-attn fallback\nuse_flash_attention_4 = False]
    F -- No --> H{is_training AND\nhd_qk != hd_v AND\nhd_qk >= 128 AND SM100?}
    H -- Yes --> I{dV TMEM misalignment?\ntile_hdimv//2 % dk_reduce_ncol != 0}
    I -- Yes --> J[log: MLA dV bug\nuse_flash_attention_4 = False]
    I -- No --> K[FA4 enabled]
    H -- No --> K
    E --> L[Fall back to other backend]
    G --> L
    J --> L

_{Reviews (6): Last reviewed commit: "Merge branch 'main' into xiny/headdim256..." | Re-trigger Greptile}

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 · 2026-05-06T02:58:57Z

/te-ci pytorch L3

yaox12 · 2026-05-06T03:03:35Z

@vcherepanov-nv @KshitijLakhani Please review.

sudhakarsingh27 · 2026-05-08T15:46:02Z

+        # dV TMEM load atoms. When (tile_hdimv // 2) % dK_reduce_ncol != 0, dV reads are
+        # misaligned. The dedicated (256, 256) kernel uses its own tmem layout so it's
+        # not affected. See: flash_attn/cute/flash_bwd_sm100.py, line ~262 and ~3890.
+        if (


Should this still be checked when FlashAttentionUtils.v4_validate_head_dims == None?

I double checked that this is a bug of FA4. Kernels produce wrong results on these shapes but they're allowed by v4_validate_head_dims, so we have to filter them out manually.
Raise an issue to FA4. Dao-AILab/flash-attention#2552

vcherepanov-nv · 2026-05-08T23:10:39Z

LGTM

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 · 2026-05-11T05:36:51Z

/te-ci pytorch L3

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 · 2026-05-12T03:55:10Z

/te-ci pytorch L3

yaox12 marked this pull request as draft April 27, 2026 09:31

yaox12 force-pushed the xiny/headdim256_fa branch from bdcc02e to 3b3f7d0 Compare April 27, 2026 09:31

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/pytorch/attention/test_attention.py Outdated

enable head dim 256 for FA4

3d0fcd7

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 force-pushed the xiny/headdim256_fa branch from 3b3f7d0 to 9a93156 Compare May 6, 2026 02:44

update CI, fix lint, resolve comments

8aa5242

Signed-off-by: Xin Yao <xiny@nvidia.com>

yaox12 force-pushed the xiny/headdim256_fa branch from ae74e44 to 8aa5242 Compare May 6, 2026 02:55

yaox12 marked this pull request as ready for review May 6, 2026 02:59

KshitijLakhani requested a review from mk-61 May 8, 2026 06:34

sudhakarsingh27 reviewed May 8, 2026

View reviewed changes

yaox12 added 2 commits May 10, 2026 22:28

resolve comments

ad00e76

Signed-off-by: Xin Yao <xiny@nvidia.com>

Merge branch 'main' into xiny/headdim256_fa

472c9dd

yaox12 added 2 commits May 12, 2026 10:30

Merge branch 'main' into xiny/headdim256_fa

3090b57

update filter

7e9faf1

Signed-off-by: Xin Yao <xiny@nvidia.com>

Merge branch 'main' into xiny/headdim256_fa

8fafa1f

yaox12 requested a review from cyanguwa as a code owner May 13, 2026 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Enable head dim 256 for FA4#2932

[PyTorch] Enable head dim 256 for FA4#2932
yaox12 wants to merge 7 commits into
NVIDIA:mainfrom
yaox12:xiny/headdim256_fa

yaox12 commented Apr 27, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

yaox12 commented May 6, 2026

Uh oh!

yaox12 commented May 6, 2026

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 May 8, 2026

Uh oh!

yaox12 May 11, 2026

Uh oh!

vcherepanov-nv commented May 8, 2026

Uh oh!

yaox12 commented May 11, 2026

Uh oh!

yaox12 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yaox12 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

yaox12 commented May 6, 2026

Uh oh!

yaox12 commented May 6, 2026

Uh oh!

Uh oh!

Uh oh!

sudhakarsingh27 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

yaox12 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

vcherepanov-nv commented May 8, 2026

Uh oh!

yaox12 commented May 11, 2026

Uh oh!

yaox12 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaox12 commented Apr 27, 2026 •

edited

Loading

greptile-apps Bot commented Apr 27, 2026 •

edited

Loading